`DataFrame` data structure
The DataFrame data structure is the heart of the Panda's library. It's a primary object used in data analysis and cleaning tasks. Conceptually, a DataFrame is a two-dimensional series object with an index and multiple columns of content, each column having a label. Essentially, a DataFrame is a two-axes labeled array.
Creating a DataFrame
Importing Pandas
import pandas as pd
Creating DataFrame from Series
Example of creating three school records for students and their class grades using pd.Series:
record1 = pd.Series({'Name': 'Alice', 'Class': 'Physics', 'Score': 85})
record2 = pd.Series({'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82})
record3 = pd.Series({'Name': 'Helen', 'Class': 'Biology', 'Score': 90})
df = pd.DataFrame([record1, record2, record3], index=['school1', 'school2', 'school1'])
df.head()
This creates a DataFrame with each series representing a row of data. The head() function shows the first several rows of the DataFrame.
Creating DataFrame from List of Dictionaries
An alternative method using a list of dictionaries:
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
{'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
{'Name': 'Helen', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()
Extracting Data from DataFrame
Using .loc and .iloc
-
Single Row Selection: Using
.locwith one parameter returns a Series.df.loc['school2']
type(df.loc['school2']) -
Multiple Rows Selection: Using
.locwith a non-unique index returns a DataFrame.df.loc['school1']
type(df.loc['school1']) -
Selecting Specific Column for Specific Rows:
df.loc['school1', 'Name']
Using Transpose (.T)
Transpose the DataFrame to pivot rows into columns.
df.T.loc['Name']
Column Selection
Directly using the indexing operator for column selection:
df['Name']
type(df['Name'])
Avoid Chaining Operations
Chaining operations can cause Pandas to return a copy instead of a view, which might be slower and cause errors during data modification. Instead, use .loc with two parameters for more efficient and clear operations:
df.loc['school1']['Name']
print(type(df.loc['school1'])) # DataFrame
print(type(df.loc['school1']['Name'])) # Series
Slicing and Selecting Multiple Columns
Using .loc to select all rows and specific columns:
df.loc[:, ['Name', 'Score']]
Dropping Data
Using .drop()
Drop rows or columns from DataFrame.
df.drop('school1')
df
# With inplace=True
df.drop('school1', inplace=True)
df
Dropping Columns
Two methods to drop columns:
- Using
.drop()with axis parameter:copy_df = df.copy()
copy_df.drop("Name", inplace=True, axis=1) - Using
delkeyword:del copy_df['Class']
Adding New Columns
Adding a new column by assigning a value:
df['ClassRanking'] = None
df